Speaking Skills Talk - Victor Akinwande August 27, 2024 1:00pm — 2:00pm Location: In Person - Newell-Simon 4305 Speaker: VICTOR AKINWANDE, Ph.D. Student, Computer Science Department, Carnegie Mellon University https://home.victorakinwande.com/ HyperCLIP: Adapting Vision-Language models with Hypernetworks Self-supervised vision-language models trained with contrastive objectives perform better as one increases their scale. Typically, the image encoder in such models are larger than the text encoder and we are often able to amortize the inference cost of the text encoder by using a predefined set of text-embedding but not with the image encoder. This poses a challenge for deploying large vision-language models especially in resource-constrained environments. In this talk, I will present HyperCLIP - a vision-language architecture that dynamically adapts a small image encoder using a hypernetwork. This hypernetwork learns to produce a subset of the image encoder parameters conditioned on the text-embedding, and the entire model (hypernetwork, image encoder, and text encoder) are trained jointly end-end. HyperCLIP increases the zero-shot accuracy of SigLIP models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead. Presented in Partial Fulfillment of the CSD Speaking Skills Requirement Event Website: https://csd.cmu.edu/calendar/speaking-skills-talk-victor-akinwande